Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject non-HTML instead of accepting only HTML #15

Merged
merged 1 commit into from
Nov 2, 2020
Merged

Conversation

chosak
Copy link
Member

@chosak chosak commented Nov 2, 2020

Trying to accept only files that end in .html causes problems when:

  1. Links on a page don't end in a trailing slash (e.g. /foo/bar), and wget interprets the link of being of type "bar", and thus rejects it (see Crawl seems to be missing some pages #9 (comment)).
  2. Long URLs get truncated when saved as files and thus don't end in .html. These get deleted by wget (see Long URLs get truncated by wget #13).

This change restores old behavior that provided an explicit rejectlist instead of only accepting html. This is a little suboptimal; it would be nice not to have to list out a potentially-ever-growing list of file extensions, but I'm not sure of a better way to accomplish what we want.

@chosak chosak requested a review from schbetsy November 2, 2020 20:51
Copy link
Collaborator

@schbetsy schbetsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the list also needs to include .csv and maybe .CSV. If you could add those, I think this is good to go after that.

Trying to accept only files that end in .html causes problems when:

1. Links on a page don't end in a trailing slash (e.g. /foo/bar), and
wget interprets the link of being of type "bar", and thus rejects it.
2. Long URLs get truncated when saved as files and thus don't end in
.html. These get deleted by wget.

This change restores old behavior that provided an explicit rejectlist
instead of only accepting html. This is a little suboptimal; it would be
nice not to have to list out a potentially-ever-growing list of file
extensions, but I'm not sure of a better way to accomplish what we want.
@chosak chosak merged commit 6ef315c into main Nov 2, 2020
@chosak chosak deleted the reject-not-accept branch November 2, 2020 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants